25 research outputs found
Distributed k-core view materialization and maintenance for large dynamic graphs
Cataloged from PDF version of article.In graph theory, k-core is a key metric used to identify subgraphs of high cohesion, also known as the ‘dense’
regions of a graph. As the real world graphs such as social network graphs grow in size, the contents get richer and the
topologies change dynamically, we are challenged not only to materialize k-core subgraphs for one time but also to maintain
them in order to keep up with continuous updates. Adding to the challenge is that real world data sets are outgrowing the
capacity of a single server and its main memory. These challenges inspired us to propose a new set of distributed algorithms
for k-core view construction and maintenance on a horizontally scaling storage and computing platform. Our algorithms execute
against the partitioned graph data in parallel and take advantage of k-core properties to aggressively prune unnecessary
computation. Experimental evaluation results demonstrated orders of magnitude speedup and advantages of maintaining k-core
incrementally and in batch windows over complete reconstruction. Our algorithms thus enable practitioners to create and
maintain many k-core views on different topics in rich social network content simultaneously
HetFS: A heterogeneous file system for everyone
Storage devices have been getting more and more diverse during the last decade. The advent of SSDs made it painfully clear that rotating devices, such as HDDs or magnetic tapes, were lacking in regards to response time. However, SSDs currently have a limited number of write cycles and a significantly larger price per capacity, which has prevented rotational technologies from begin abandoned. Additionally, Non-Volatile Memories (NVMs) have been lately gaining traction, offering devices that typically outperform NAND-based SSDs but exhibit a full new set of idiosyncrasies.
Therefore, in order to appropriately support this diversity, intelligent mechanisms will be needed in the near-future to balance the benefits and drawbacks of each storage technology available to a system. In this paper, we present a first step towards such a mechanism called HetFS, an extension to the ZFS file system that is capable of choosing the storage device a file should be kept in according to preprogrammed filters. We introduce the prototype and show some preliminary results of the effects obtained when placing specific files into different devices.The research leading to these results has received funding from the European
Community under the BIGStorage ETN (Project 642963 of the H2020-MSCA-ITN-2014), by the Spanish Ministry of Economy and Competitiveness under
the TIN2015-65316 grant and by the Catalan Government under the 2014-SGR-
1051 grant. To learn more about the BigStorage project, please visit http:
//bigstorage-project.eu/.Peer ReviewedPostprint (author's final draft
Graph aware caching policy for distributed graph stores
Graph stores are becoming increasingly popular among NOSQL applications seeking flexibility and heterogeneity in managing linked data. Conceptually and in practice, applications ranging from social networks, knowledge representations to Internet of things benefit from graph data stores built on a combination of relational and non-relational technologies aimed at desired performance characteristics. The most common data access pattern in querying graph stores is to traverse from a node to its neighboring nodes. This paper studies the impact of such traversal pattern to common data caching policies in a partitioned data environment where a big graph is distributed across servers in a cluster. We propose and evaluate a new graph aware caching policy designed to keep and evict nodes, edges and their metadata optimized for query traversal pattern. The algorithm distinguishes the topology of the graph as well as the latency of access to the graph nodes and neighbors. We implemented graph aware caching on a distributed data store Apache HBase in the Hadoop family. Performance evaluations showed up to 15x speedup on the benchmark datasets preferring our new graph aware policy over non-aware policies. We also show how to improve the performance of existing caching algorithms for distributed graphs by exploiting the topology information. © 2015 IEEE
Multi-resolution social network community identification and maintenance on big data platform
Community identification in social networks is of great interest and with dynamic changes to its graph representation and content, the incremental maintenance of community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multi-resolution community representation that has to be maintained over time. In this paper, we first formalize this problem using the k-core metric projected at multiple k values, so that multiple community resolutions are represented with multiple k-core graphs. We then present distributed algorithms to construct and maintain a multi-k-core graph, implemented on the scalable big-data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi-k-core incrementally over complete reconstruction. Our algorithms thus enable practitioners to create and maintain communities at multiple resolutions on different topics in rich social network content simultaneously. © 2013 IEEE
Efficient community identification and maintenance at multiple resolutions on distributed datastores
The topic of network community identification at multiple resolutions is of great interest in practice to learn high cohesive subnetworks about different subjects in a network. For instance, one might examine the interconnections among web pages, blogs and social content to identify pockets of influencers on subjects like 'Big Data', 'smart phone' or 'global warming'. With dynamic changes to its graph representation and content, the incremental maintenance of a community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multi-resolution community representation that has to be maintained over time. In this paper, we first formalize this problem using the k-core metric projected at multiple k-values, so that multiple community resolutions are represented with multiple k-core graphs. Recognizing that large graphs and their even larger attributed content cannot be stored and managed by a single server, we then propose distributed algorithms to construct and maintain a multi-k-core graph, implemented on the scalable Big Data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi-k-core incrementally over complete reconstruction. Our algorithms thus enable practitioners to create and maintain communities at multiple resolutions on multiple subjects in rich network content simultaneously. © 2015 Elsevier B.V. All rights reserved
Routes for breaching and protecting genetic privacy
We are entering the era of ubiquitous genetic information for research,
clinical care, and personal curiosity. Sharing these datasets is vital for
rapid progress in understanding the genetic basis of human diseases. However,
one growing concern is the ability to protect the genetic privacy of the data
originators. Here, we technically map threats to genetic privacy and discuss
potential mitigation strategies for privacy-preserving dissemination of genetic
data.Comment: Draft for comment